Moses Kamoga
The overall objective of this project is to analyze house data from the square footage to the location of a house to be able to better predict the price of a house in the state of washington, and not only give house buyers an estimate of what there budget should entail if they are looking to buy a house in a particular but also give them some useful information of where they get a house with in there budget in a great location.
Through out this tutorial we will fully analyze all the variables in this dataset and understand if they are highly significant in explaining the change in price of a house in washington state. We will also offer more insights in the data by looking at the most expensive and least expensice cities in washington state so that we can properly guide buyers and sellers altogether in their quest to manuever the real estate market in Washington.
1.0 DATA COLLECTION
1.1) Source of the dataset
1.2) Variables in the dataset
1.3) Background for the interest in this topic
2.0 SETTING THIS UP
2.1) Needed python libraries
2.2) Importing the dataset
2.3) View data
2.4) Clean data
3.0 DATA VISUALIZATION
3.1) Checking for correlation between the response variable(price) and independent variables
3.2) Generating the scatterplots
3.3) Analysing scatterplots
4.0 PERFORMING REGRESSION ON THE DATA
4.1) Analysing the regression model
4.2) Newly fitted model
4.3) Doing the F-Partial Test
4.4) Regression on the Final Model
5.0 FINAL REGRESSION EQUATION FOR THE MODEL
6.0 PUTTING THE FINAL MODEL TO THE TEST
6.1) Confidence interval
7.0 EXPLORING THE TOP MOST EXPENSIVE CITY TO PURCHASE A HOUSE IN WASHINGSTON STATE
7.1) Generating the the desired table to be analysed and visualized
7.2) Histogram to help us visualize the average price of houses in different cities in WA
7.3) Analysing the histogram.
8.0 MOST EXPENSIVE STREETS IN THE MOST EXPENSIVE CITY IN WA
8.1) organising the data
8.2) Histogram to help us visualize the average prices of houses by street in seatle.
9.0 CONCLUSION
The dataset used is got from kaggle and the link is here https://www.kaggle.com/shree1992/housedata#data.csv
This dataset has 18 columns that is:-
The date which data was entered.
The price of the house.
The number of bedrooms that a particular house has.
The number of bathrooms that a house.
The number of living square feet of a given house.
Is the size of a lot or floor space of the house.
The number of floors that a house has.
Indicates whether the house is a waterfront property or not. It is categorical variable with '0' indicating that the house is not a waterdront property and '1' indicating that the house is a waterfront property.
The number of views that the has. This is a continuous variable.
This indicates the condition that the house is in from 1 - 4. With 1 indicating that the house is not in a good condition and 4 being that the house is in a great condition.
The square feet of the house above ground.
The square feet of the basement that is if a property has a basement. But, if the house has no basement then this variable is denoted by 0.
This is the year the house was built.
This is the year the house was renovated.
This denotes the street the house is built on.
The city where the particular house is found.
The denotes the state and the zipcode of a particular house.
This denotes the country in which the house is built.
1.3) Background of why i am interested in developing a model to predict the price of houses in Seattle
I have always been fascinated with models that predict future trends. This fascination has overtime turned into an obsession and thus, i knew given the opportunity, knowledge and resources. I always wanted to develop a model to predict future trends. Being a college senior, time is nearing when i am supposed to look at houses to buy for my future family or for investment purposes. And thus, what better way to prepare myself for this journey through the muddy waters of Real estate and understand underlying factors that influence the price of a house. For this tutorial my focus is Washington state.
- Pandas: used for data display and partitioning
- Matplotlib - pyplot: used for plotting Pandas data into graphs and charts
- Seaborn: provides a high-level interface for graphics on top of Matplotlib
- scikit-learn: very popular machine learning library
- linear-model: used to calculate models for multiple linear regression
- model_selection: used to split up dataset into test and training data and evaluate predictions
- statsmodels - api: used to calculate models and statistics with multple linear regression
We import the dataset file from kaggle by manually downloading it onto the local computer. From the local computer we store the dataset as a .csv in the folder where this notebook file is located. Then we use a pandas read_csv to read the file into the notebook to be able to carry out the analysis as shown below.
!pip install lxml
!pip install html5lib
!pip install pydotplus
!pip install folium
import folium
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs
import matplotlib.pyplot as plt
import requests
import lxml
import html5lib
import seaborn as sns
import statsmodels.formula.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from PIL import Image
from urllib.request import Request, urlopen
import scipy.stats as stats
import math
#Importing the house price prediction dataset in our python project for analysis
df = pd.read_csv('data.csv')
#The number of rows in our dataset.
print(df.shape)
I will be carrying out the analysis of data on 4600 rows. This is more than enough rows for us to be confident in our findings and the conclusion we will reach at the end of this tutorial.
#Below is how the table of the dataset looks like
df.head()
Our goal is to predict price of a house in Washington state. Just by eyeballing the data. We can immediately see what columns we do not need in our analysis. We definitely do not need the date column. There is no way, it can make sense to use the date of the day the entry was entered to predict the price of a house. The date column has to be eliminated from the dataset. Inspecting the dataset further, the country column is also not necessary. Since all the houses in the dataset are from the USA and more importantly the analysis we are doing is for houses sold in the Washington state which is in the USA.
In conclusion, both the first column and last column which is date and country respectively, have to be deleted. I do not need to carry out any tests, common sense prevails here and does save me alot of time.
#deleting the first column from the dataset.
df = df.drop(['date'], axis = 1)
#deleting the last column from the dataset
df = df.drop(['country'], axis = 1)
#The new look dataset
df.head()
From 18 columns we initially started with, we are now down to 16 columns. I will do the analysis on these 16 columns and from these, we will determine which variables are significant in explaining an increase or decrease in the price of the house.
Determining the correlation between the response variable and the predictors gives us an insight into what variables highly influence price. Using the scatter plots we can tell if the variable is highly correlated with price. With the scatter plot we can determine this, by eye balling the graphs and determining the level of correlation between the price and the variable.
I use matplotlib do generate the number of subplots that I want to create. In this instance below, I generate 3 columns and 5 rows of the sublots. Now, i create an array and store all the column names of the dataset in the dataset. In this array comes in handy when i am generating the pivot tables that i will use to plot on the subplots. The pivot table have a y-axis of price which is the response variable and an x-axis of the variable that we get from the array, which obtained through iteration over the array. with the pivot table when I execute data.pivot_table().plot(), this will populate the individual graphs as line scatter plots unless specified others in the attribute section.
#making the scatter plots
fig = plt.figure()
#axis = df.plot(subplots=True)
fig, axis = plt.subplots(5, 3, figsize=(20, 20))
columns = []
for column in df.columns:
if column == 'price':
continue
columns.append(column)
for i , ax in enumerate(axis.flat):
##print(variable)
if (columns[i] != 'price'):
df.pivot_table('price', index=columns[i]).plot(ax = ax)
We learn alot about our data by just eyeballing the graphs. Dissecting the graphs one by one to get a better understanding of each variable with our response variable which is 'price'.
1. price vs bedroom graph
From this graph we can see that the price is high at 0 and then it decreases towards 1. From 1, increases and then from 9 and above it increases. This does not make sense at all. We expectant the price of a house to increase as the number of bedrooms increases. This abnormality can attributed to other factors such as the house being a waterfront property or located in a zipcode that is very expensive. Later in this tutorial we will be able to prove this.
2. price vs bathrooms
This graph almost behaves as the same way as the graph of price vs bedrooms. Having many bathrooms in the house does not necessarily mean that the price of the will go up. This can be attributed to the same factors as we discussed earlier in 'price vs bedroom'. We will analyse this in detail more and maybe our evidence or analysis will be support our conclusions here
3. price vs sq_ft living, price vs sqft_lot graphs
These also follow the same trend as the previous graphs. Later on we will be able to make final conclusions after doing more analysis that will help us solidify the fact that we know
4. price vs floor
This graph follows the other graph where the number of floors does not necessarily mean that the price of the house will be high. Others do come into place like where the house is located and everything.
5. price vs waterfront
The more waterfront sides the house has, the more the price of the house is. We can see that is graph is highly correlated. It keeps rising as the price increase and the water front increases. We will prove this later on and the regression analysis we do onn this data should indicate a big positive coefficient to account for this steep rise of the line.
6. price vs view
This graph is not as steep as the 'price vs waterfront' but also is correlated with price. As the number of views increase the price of the house tend to increase though there a few instances where that is not case.
7. price vs condition
We all believe that the better the condition of the house, the higher the price is going to be. The graph clearly indicates that, if the condition of the house is good. Then it is going to cost a little more. There are a few instances that do not go with the norm but the regression later will enable us explain these descreparancies or abnormalities in our graph.
8. price vs sqft_above, price vs sqft_basement.
These graphs are not really clear if they are correlated with price. We do observe random spikes but, it does not tell me if there is a correlation. More analysis on this will definitely give us a clear picture of what is going on here.
9. price vs yr_built, price vs yer_renovated
Both these graphs maybe not really correlate to price at all. This means that the year a house was built or rennovated may not have an affect on the price of the house. We can not really ascertain without doing extra tests but from eyeballing the graphs we can safely say that for now..
10 price vs street, price vs city, price vs statezip
Street, city and statezip are not numerical vartiables and thus just by eyeballing the graphs does not tell us anything about their relationship with the price of the house. We will need to do a lot more analysis and alot more visuals to clearly be able to say how price is influenced by the street the house is built on, the city in which the house is built, and the statezip where the house is located.
After eyeballinng the graphs above and making assumptions and conclusions off it, it is time to put the assumptions and the conclusions we made above to the test. We will achieve this by running a linear regression on the dataset and with real proof, we can acertain that our assumptions or conclusions above are indeed true or it was a bunch of bullshit..haha Also, this will help us in understanding the variables that we could not really say anything earlier how they affect price. Before we do a regression on our data. We will remove three variables that is:- statezip, city and street as we have determined in the graphs above that the price of a house is influenced by the city, street and statezip it is located. With this we can sure that the location of the house affects the price of the house either positively of negatively.
This is due to other factors that are not explained in the model such as, -if the neighbourhood is safe. -if the neighbourhood has great schools. -if the neighbourhood is in a prime location. -the cost of living in the particular state, also does influence the price of the house. If the cost of living is high, the house prices will too, be high and vice versa.
I will do a regression on the rest of the variables to determine how significant they influence the price of a house.
#Generating a regression model for the data.
newTable = df.copy()
#dropping the 3 variables
newTable = newTable.drop(['street'],1)
newTable = newTable.drop(['city'],1)
newTable = newTable.drop(['statezip'],1)
result = sm.ols(formula="price ~ bathrooms + bedrooms + condition + floors + sqft_above + sqft_basement + sqft_living + sqft_lot + view + waterfront + C(yr_built) + C(yr_renovated) ", data=newTable).fit()
print(result.summary())
With the coefficients we can tell if a variable decreases or increases the price of a house, that is, with a negative coefficient, an increase in the units of the variable decreases the price of the house. And with the positive coefficient an increase in the units of the variable will lead to an increase in the price of the house.
From the model above, we can tell bathrooms, condition, floors, sqft_above, sqft_basement, sqft_living, view, waterfront, yr_renovated, these are significant in explaining an increase in the price of a house.
And, bedrooms and yr_built are significant in explaining the decrease in the price of a house. With yr_built, we can use our commonsense to understand and conclude that houses that are old or where built longtime are somewhat more expensive than the houses built recently. Understanding why bedrooms negatively affect the price of the house, we could argue that houses in city center that is apartments with less bedrooms are more expensive than houses in the surburbs. Due to their(houses in the city) location which is very appealing to buyers.
Now, we look at the p-values. These values will enable us determine how significant the the variable is in explaining the change in the price of a house. If the p-value is approximately zero, this means that the variable is significant in explaining the change(increase or decrease) in the price of a house. From the model above, we can see that all the variables are significant in explaining the change(increase or decrease) in the price of a house except yr_rennovated.
To answer this question, we look at the R - squared. The value of R - squared is too small that thus, this model is not significant in explaining the response in price.
To understand this, maybe the 3 variables, have influenced the low R - squared value.
We fit a new model with all the original variables.
Fitting a new model including the city, statezip and street. In the previous model, we can also that the p_values for the year built and year renovated were too high and thus we can remove them from the newly fitted model. And the floors p_value is also too high and thus we remove it from the new model. Plus we know, we know that statezip and city are not really needed since the street address of a house that provide that. So, the columns of city and statezip will be removed to eliminate of redundancy in our model. From the street address, we can derive city, state, and zipcode.
#Copying the content of the dataset.
newModel = df.copy()
#fitting the new model without the variables floors, city and state with reasons given above but including street.
result_new_Model = sm.ols(formula="price ~ bathrooms + bedrooms + condition + sqft_above + sqft_basement + sqft_living + sqft_lot + view + waterfront + street ", data=newModel).fit()
#print(result_new_Model.summary())
R-squared is 99% which is very high and thus shows that the model is significant is explaining the change in price. This is what we were looking for, but we still have work to do since there are variables that have a high p-value. If the p-value is not approximately zero then this signifies that particular variable is not significant in explaining the change in price. Thus, we have to remove these variables to remain with only those variables that are significant in explaining the change in price.
Variables in question are:-
-waterfront which has a p-value of 0.117
-sqft_basement which has a p-value of 0.180
-condition which has a p-value of 0.969
-bedrooms which has a p-value of 0.932
The Adj. R-squared does account for the predictors that are not significant in explaining the change in the response variable.
We do the F partial test to prove that our new model is significant in explaining the change in price of a house than the old model. If the p_value is approximately zero. This let us know that our model is better.
Hypothesis
null hypothesis, the original model with the variables waterfront, sqft_basement, condition, bedrooms is significant in explaining the change in price better than the small model.
Alternative hypothesis, the new model without the variables waterfront, sqft_basement, condition, bedrooms is significant in explaining the change in price better than full model.
F-Partial test
the p-value is approximately zero and thus we reject null hypothesis.
Thus with this test, we can fit the new model (smaller model).
We fit a final model with the varibles above removed.
#Fitting the new model
newModelTable = df.copy()
result_NewModel_Table = sm.ols(formula="price ~ bathrooms + sqft_above + sqft_lot + view + C(waterfront) + C(street)", data=newModelTable).fit()
#print(result_NewModel_Table.summary())
In this model, R - squared is 99% which means that this model is significant in explaining the change in price. Also, though there is no significant change in the Adj. R - squared but there is a slight increase of 0.2.
Below is the equation for the model that can be used to predict the price of the house given the independent variables in the model.
price_estimate = -4.768e+05 + 3.5e+04(waterfront)+ 1.13e+06(10 W Etruria St)+....+1.15e+05(Valley View Trail) - 1.512e+05(bathrooms) + 519.1622(sqft_above) + 53.2727(sqft_lot) 3.146e+05(view)
where:-
streets are categorical variables, there are like 4600 of them in our regression table and thus, i will not be able to put all of them here. In normal circumstances all the streets are listed in the final equation and if a house belongs on a particular street, in that case, that street is initialised to 1 and all the other streets are initialized to 0.
waterfront is also categorical variable where if the house is located on a waterfront, it is initailized to 1 and if it is not, then it is initialized to 0.
We put this model to the test. Now that we have removed the variables that do not significantly explain the change in price. Our generated model is put to the test by getting a real time listing from washington state and see what our model predicts. Below is a real listing of a house in Seattle which has been on the market recently and has just got sold for 615,000 dollars.
Link for the listing --> https://www.zillow.com/homedetails/2022-24th-Ave-E-Seattle-WA-98112/49098189_zpid/?
#Displayed is a real listing that has just sold on 2 december 2019
display(Image.open("Screen Shot 2019-12-09 at 1.14.19 PM.png"))
As pointed out earlier, with the listing we have the street, state, city and the zipcode of where the house was located. From the formula above, we can plug in the variables from the listing..
The street is ---> 24th Ave E Seatle, WE 98112 and from the model the coefficient is ---> 6.979e+05
waterfront ---> the house is not located on a waterfront. So, this variable takes a zero.
The number of bathrooms is ----> 2
Sqft_above ---> 430 this is calculated by subtracting the total sqft(860sqft) from the basement sqft(430sqft).
Sqft_lot ---> 3088
view ---> the number of views that the house has. And in this case, it is 1.
We plug these values into our formular to estimate the price of the house.
price_Estimate = -4.768e+05 + (3.5e+04*0) + (6.979e+05 * 1) - (1.512e+05 * 2) + (53.2727 * 3088) + (3.146e+05 * 1) + (519.1622 * 430)
print(price_Estimate)
Our model estimates the price of the house to be 621045.8436 The house actually sold for 615,000 We can be confident in our model predicting the price of a house with a minimal error.
#The difference of our model and the actual price the house went for is
diff_price = 621045.8436 - 615000
print(diff_price)
A price estimate like the one we generated above of a house in Seattle does gives us the rough idea of what the actual price will be. But, we know they are prone to errors and running multiple samples is not efficient either. A confidence interval is a range of values above and below a point estimate that captures the true population parameter at some predetermined confidence level.
We will calculate the 95% confidence interval, to be certain that the true value of the price of the listed house lies within that interval.
With 95% confidence,
We use stats.norm.ppf(q = 0.975) to get the desired z-critical value instead of q = 0.95(95%) because the distribution has two tails. We calculate the standard deviation
#calculating the critical value.
z_critical = stats.norm.ppf(q = 0.975)
# Check the z-critical value
print("z-critical value:")
print(z_critical)
price_stdev = df['price'].std() # Get the population standard deviation
margin_of_error = z_critical * (price_stdev/math.sqrt(len(df)))
confidence_interval = (price_Estimate - margin_of_error,
price_Estimate + margin_of_error)
print(price_stdev)
print(confidence_interval)
From our data after predicting the price of a house in a seatle. We will look at the most expensive cities to live, in seatle. We will do this by calculating the average price of a houses in a city in Washington state. This can be achieved by using groupby and an aggregate function of mean. This adds the price of all houses in a city and divides by the total number of houses in that to get the average price of houses in that particular city in our dataset.
#gettinng the average price of houses in a given city
sum_Price_Per_City = df.groupby(['city'])['price'].agg('mean').reset_index()
#sum_Price_Per_Street = sum_Price_Per_Street[['CITY','TOT_HOUSE_PRICES']]
sum_Price_Per_City.columns = ['CITY','AVG_HOUSE_PRICES_CITY']
#displaying oonly 20 cities here!
sum_Price_Per_City.head(20)
#graphing the histogram
plt.figure(figsize=(15,10))
plt.title("Average price of houses in a given city in Washington State", fontsize=16)
sns.barplot(data=sum_Price_Per_City, x='AVG_HOUSE_PRICES_CITY', y='CITY')
plt.show()
From the histogram we can immediately observe the disparities in income distribution among the washington state population. Looking at the histogram, Seattle by far has the highest average price of houses. This can be attributed to the booming technology sector in that area. Where big tech giants, the likes of Amazon and Microsoft are headquartered in the area have led to exorbitant house prices.
From a seatle times online editorial authored by Rosenberg, the price of houses in Seattle are expected to even go higher with a looming Amazon second HQ. The only city that tries to come close to seatle is Believue. Its high average price of houses can also be attributed to Amazon because Believue was the first HQ of Amazon before they moved to Seattle. Amazon still maintained a presence in Believue and according to the Business insider article authored by Chan, "Amazon is inching its way back to its original home city". They are planning to build there tallest tower in that city and move all there world wide operations there.
This means that, now is the right time to buy property in Believue before everything goes crazy like what happened in Seattle. As a buyer looking to buy property in Washington state, even though the prices are a little bit high Believue will be a good buy because the future looks bright. And if you are planning to sell living in that area, my advise will be to hold onto your house and wait for the rainfall that is, Amazon's second coming.
Given that Seattle by far is way more expensive than all the other cities in Washington state. Lets explore the expensive and cheap neighbourhoods in that city. That is, if someone really wanted to buy a house in seatle. What neighbourhoods or streets should they look at it, that they can afford.
To plot the histogram to enable us visualize and also understand different pricing of houses on different streets in seatle. We will extract all the rows that have the column 'city' == Seattle from the dataset. Then we will pretty much carry out the calculations the same way we did when we were trying to find out total price of all houses in washington state grouped by city. But, this time we will group by street.
Given, there is alot of streets in Seattle. Having every street on the historgram will not enable us fully analyze the average prices on every street and make informed decisions. So, we will plot the top 75 most expensive streets/neighbourhoods in Seatle. We will sort the values in ascending order and then get the top 75 most expensive streets(streets with a high average price) and these are the ones we will plot on the histogram for analysis.
#Extracting the rows where city = Seattle from the datatset
seatle_Streets_Table = df[(df['city'] == 'Seattle')]
sum_Price_Per_Street = seatle_Streets_Table.groupby(['street'])['price'].agg('mean').reset_index()
sum_Price_Per_Street.columns = ['Street','AVG_HOUSE_PRICES']
#sorting the values in descending so that we ca get the top 75.
sorted_Price_Per_Street = sum_Price_Per_Street.copy().sort_values('AVG_HOUSE_PRICES', ascending=False)
#table for the top 75 most expensive streets in Seattle.
top_75 = sorted_Price_Per_Street.head(75)
top_75.head()
plt.figure(figsize=(15,15))
plt.title("Average price of houses on a given street in Seattle ", fontsize=16)
sns.barplot(data=top_75, x='AVG_HOUSE_PRICES', y='Street')
plt.show()
From the histogram, we can observe that the street with the highest average price of a house is 5426 40th Ave W. The average price is well above 1.2 and it is astronomically higher than the street with the second most high average price. This does indicate the income inequality between various streets of seattle. Where one neighbourhood has astronomically high prices than the other.
Our model predicted the price of a house in seattle with a few thousands off from the actual price that the house really sold for as shown by a real time listing above. Running the regression model, we realised that most of the variables were not significant in explaining the change in the price of a house. But also, we noted that the street on which the house was located, was highly significant in explaining the change in price of a house. And this can be attributed to the fact that they when we got rid of the street column. R-squared was too low to even do any analysis on the model. Further examination of this variable 'street' through histogram gave us a good visual of the significant of the variable in our model.